The Thera bank recently saw a steep decline in the number of users of their credit card, credit cards are a good source of income for banks because of different kinds of fees charged by the banks like annual fees, balance transfer fees, and cash advance fees, late payment fees, foreign transaction fees, and others. Some fees are charged to every user irrespective of usage, while others are charged under specified circumstances.
Customers’ leaving credit cards services would lead bank to loss, so the bank wants to analyze the data of customers and identify the customers who will leave their credit card services and reason for same – so that bank could improve upon those areas
You as a Data scientist at Thera bank need to come up with a classification model that will help the bank improve its services so that customers do not renounce their credit cards
You need to identify the best possible model that will give the required performance
Explore and visualize the dataset. Build a classification model to predict if the customer is going to churn or not Optimize the model using appropriate techniques Generate a set of insights and recommendations that will help the bank
Data Dictionary:
CLIENTNUM: Client number. Unique identifier for the customer holding the account
Attrition_Flag: Internal event (customer activity) variable - if the account is closed then "Attrited Customer" else "Existing Customer"
Customer_Age: Age in Years
Gender: Gender of the account holder
Dependent_count: Number of dependents
Education_Level: Educational Qualification of the account holder - Graduate, High School, Unknown, Uneducated, College(refers to a college student), Post-Graduate, Doctorate.
Marital_Status: Marital Status of the account holder
Income_Category: Annual Income Category of the account holder
Card_Category: Type of Card
Months_on_book: Period of relationship with the bank
Total_Relationship_Count: Total no. of products held by the customer
Months_Inactive_12_mon: No. of months inactive in the last 12 months
Contacts_Count_12_mon: No. of Contacts between the customer and bank in the last 12 months
Credit_Limit: Credit Limit on the Credit Card
Total_Revolving_Bal: The balance that carries over from one month to the next is the revolving balance
Avg_Open_To_Buy: Open to Buy refers to the amount left on the credit card to use (Average of last 12 months)
Total_Trans_Amt: Total Transaction Amount (Last 12 months)
Total_Trans_Ct: Total Transaction Count (Last 12 months)
Total_Ct_Chng_Q4_Q1: Ratio of the total transaction count in 4th quarter and the total transaction count in 1st quarter
Total_Amt_Chng_Q4_Q1: Ratio of the total transaction amount in 4th quarter and the total transaction amount in 1st quarter
Avg_Utilization_Ratio: Represents how much of the available credit the customer spent
# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd
# Libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Libraries to tune model, get different metric scores, and split data
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.metrics import accuracy_score,precision_score,recall_score
from sklearn import metrics
# Library to impute missing values
from sklearn.impute import KNNImputer
# Library to build a logistic regression model
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier
from sklearn import metrics
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
roc_auc_score,
plot_confusion_matrix,
)
from sklearn import model_selection
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
# To be used for data scaling and one hot encoding
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
# To be used for tuning the model
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
# To be used for creating pipelines and personalizing them
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error as MSE
from sklearn import datasets
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
# To be used for missing value imputation
from sklearn.impute import SimpleImputer
# Library to supress the warning
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
from sklearn.preprocessing import LabelEncoder
pd.set_option('display.max_columns', None)
data = pd.read_csv('BankChurners.csv')
df= data.copy()
df.head()
| CLIENTNUM | Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 768805383 | Existing Customer | 45 | M | 3 | High School | Married | $60K - $80K | Blue | 39 | 5 | 1 | 3 | 12691.0 | 777 | 11914.0 | 1.335 | 1144 | 42 | 1.625 | 0.061 |
| 1 | 818770008 | Existing Customer | 49 | F | 5 | Graduate | Single | Less than $40K | Blue | 44 | 6 | 1 | 2 | 8256.0 | 864 | 7392.0 | 1.541 | 1291 | 33 | 3.714 | 0.105 |
| 2 | 713982108 | Existing Customer | 51 | M | 3 | Graduate | Married | $80K - $120K | Blue | 36 | 4 | 1 | 0 | 3418.0 | 0 | 3418.0 | 2.594 | 1887 | 20 | 2.333 | 0.000 |
| 3 | 769911858 | Existing Customer | 40 | F | 4 | High School | NaN | Less than $40K | Blue | 34 | 3 | 4 | 1 | 3313.0 | 2517 | 796.0 | 1.405 | 1171 | 20 | 2.333 | 0.760 |
| 4 | 709106358 | Existing Customer | 40 | M | 3 | Uneducated | Married | $60K - $80K | Blue | 21 | 5 | 1 | 0 | 4716.0 | 0 | 4716.0 | 2.175 | 816 | 28 | 2.500 | 0.000 |
np.unique(df["Attrition_Flag"])
array(['Attrited Customer', 'Existing Customer'], dtype=object)
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10127 entries, 0 to 10126 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CLIENTNUM 10127 non-null int64 1 Attrition_Flag 10127 non-null object 2 Customer_Age 10127 non-null int64 3 Gender 10127 non-null object 4 Dependent_count 10127 non-null int64 5 Education_Level 8608 non-null object 6 Marital_Status 9378 non-null object 7 Income_Category 10127 non-null object 8 Card_Category 10127 non-null object 9 Months_on_book 10127 non-null int64 10 Total_Relationship_Count 10127 non-null int64 11 Months_Inactive_12_mon 10127 non-null int64 12 Contacts_Count_12_mon 10127 non-null int64 13 Credit_Limit 10127 non-null float64 14 Total_Revolving_Bal 10127 non-null int64 15 Avg_Open_To_Buy 10127 non-null float64 16 Total_Amt_Chng_Q4_Q1 10127 non-null float64 17 Total_Trans_Amt 10127 non-null int64 18 Total_Trans_Ct 10127 non-null int64 19 Total_Ct_Chng_Q4_Q1 10127 non-null float64 20 Avg_Utilization_Ratio 10127 non-null float64 dtypes: float64(5), int64(10), object(6) memory usage: 1.6+ MB
• There are no missing values
df.shape
(10127, 21)
df.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| CLIENTNUM | 10127.0 | 7.391776e+08 | 3.690378e+07 | 708082083.0 | 7.130368e+08 | 7.179264e+08 | 7.731435e+08 | 8.283431e+08 |
| Customer_Age | 10127.0 | 4.632596e+01 | 8.016814e+00 | 26.0 | 4.100000e+01 | 4.600000e+01 | 5.200000e+01 | 7.300000e+01 |
| Dependent_count | 10127.0 | 2.346203e+00 | 1.298908e+00 | 0.0 | 1.000000e+00 | 2.000000e+00 | 3.000000e+00 | 5.000000e+00 |
| Months_on_book | 10127.0 | 3.592841e+01 | 7.986416e+00 | 13.0 | 3.100000e+01 | 3.600000e+01 | 4.000000e+01 | 5.600000e+01 |
| Total_Relationship_Count | 10127.0 | 3.812580e+00 | 1.554408e+00 | 1.0 | 3.000000e+00 | 4.000000e+00 | 5.000000e+00 | 6.000000e+00 |
| Months_Inactive_12_mon | 10127.0 | 2.341167e+00 | 1.010622e+00 | 0.0 | 2.000000e+00 | 2.000000e+00 | 3.000000e+00 | 6.000000e+00 |
| Contacts_Count_12_mon | 10127.0 | 2.455317e+00 | 1.106225e+00 | 0.0 | 2.000000e+00 | 2.000000e+00 | 3.000000e+00 | 6.000000e+00 |
| Credit_Limit | 10127.0 | 8.631954e+03 | 9.088777e+03 | 1438.3 | 2.555000e+03 | 4.549000e+03 | 1.106750e+04 | 3.451600e+04 |
| Total_Revolving_Bal | 10127.0 | 1.162814e+03 | 8.149873e+02 | 0.0 | 3.590000e+02 | 1.276000e+03 | 1.784000e+03 | 2.517000e+03 |
| Avg_Open_To_Buy | 10127.0 | 7.469140e+03 | 9.090685e+03 | 3.0 | 1.324500e+03 | 3.474000e+03 | 9.859000e+03 | 3.451600e+04 |
| Total_Amt_Chng_Q4_Q1 | 10127.0 | 7.599407e-01 | 2.192068e-01 | 0.0 | 6.310000e-01 | 7.360000e-01 | 8.590000e-01 | 3.397000e+00 |
| Total_Trans_Amt | 10127.0 | 4.404086e+03 | 3.397129e+03 | 510.0 | 2.155500e+03 | 3.899000e+03 | 4.741000e+03 | 1.848400e+04 |
| Total_Trans_Ct | 10127.0 | 6.485869e+01 | 2.347257e+01 | 10.0 | 4.500000e+01 | 6.700000e+01 | 8.100000e+01 | 1.390000e+02 |
| Total_Ct_Chng_Q4_Q1 | 10127.0 | 7.122224e-01 | 2.380861e-01 | 0.0 | 5.820000e-01 | 7.020000e-01 | 8.180000e-01 | 3.714000e+00 |
| Avg_Utilization_Ratio | 10127.0 | 2.748936e-01 | 2.756915e-01 | 0.0 | 2.300000e-02 | 1.760000e-01 | 5.030000e-01 | 9.990000e-01 |
• Clientnum is a unique identifer and can be dropped.
df.drop(['CLIENTNUM'],axis=1,inplace=True)
df.describe(include=['object']).T
| count | unique | top | freq | |
|---|---|---|---|---|
| Attrition_Flag | 10127 | 2 | Existing Customer | 8500 |
| Gender | 10127 | 2 | F | 5358 |
| Education_Level | 8608 | 6 | Graduate | 3128 |
| Marital_Status | 9378 | 3 | Married | 4687 |
| Income_Category | 10127 | 6 | Less than $40K | 3561 |
| Card_Category | 10127 | 4 | Blue | 9436 |
• Most of the Customers surveid are current customers and have an open account. • Most of the customers completed graduate school. • Most of the customers are Married. • Most of the customers make less than $40K a year. - This seems weird since they have graduate degrees. • Most customers are blue card members.
# Utility function
# takes the numerical column as the input and returns the boxplots
def histogram_boxplot(feature, figsize=(15,10), bins = None):
""" Boxplot and histogram combined
feature: 1-d feature array
figsize: size of fig (default (9,8))
bins: number of bins (default None / auto)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(nrows = 2,
sharex = True,
gridspec_kw = {"height_ratios": (.25, .75)},
figsize = figsize
)
# creating the 2 subplots
sns.boxplot(feature, ax=ax_box2, showmeans=True, color='violet')
if bins:
sns.distplot(feature, kde=F, ax=ax_hist2, bins=bins,color = 'orange')
else:
sns.distplot(feature, kde=False, ax=ax_hist2,color='tab:cyan')
ax_hist2.axvline(np.mean(feature), color='purple', linestyle='--')
ax_hist2.axvline(np.median(feature), color='black', linestyle='-')
histogram_boxplot(data['Credit_Limit'])
• Average credit limit of a customer is ~ $8,500, Credit limit is right skewed.
histogram_boxplot(data['Months_on_book'])
• Average months a customer holds a card ~ 35 months.
"""
Utility to put percolation on bar charts
"""
def perc_on_bar(z):
total = len(data[z]) # length of the column
plt.figure(figsize=(15,5))
ax = sns.countplot(data[z],palette='Paired')
for p in ax.patches:
percentage = '{:.1f}%'.format(100 * p.get_height()/total)
x = p.get_x() + p.get_width() / 2 - 0.05 # width of the plot
y = p.get_y() + p.get_height() # hieght of the plot
ax.annotate(percentage, (x, y), size = 12)
plt.show()
perc_on_bar('Gender')
• Female and Male are more or less equal (with M < F)
perc_on_bar('Income_Category')
Most customers make less than $40K a year, with second going to 40-60K
perc_on_bar('Attrition_Flag')
• 84% are existing customers
perc_on_bar('Education_Level')
• 31% of customers hold graduate degrees, second high school only.
sns.distplot(df['Avg_Utilization_Ratio'], kde=True, rug=True, color='green')
<AxesSubplot:xlabel='Avg_Utilization_Ratio', ylabel='Density'>
• Utilization is right skewed
sns.distplot(df['Dependent_count'], kde=True, rug=True,color='green')
<AxesSubplot:xlabel='Dependent_count', ylabel='Density'>
• Customers have 2 to 3 dependents
sns.pairplot(df,hue='Attrition_Flag',palette='crest')
<seaborn.axisgrid.PairGrid at 0x7f347e9cc910>
plt.figure(figsize=(10,7))
sns.heatmap(df.corr(),annot=True,vmin=-1,vmax=1,fmt='.2g', cmap="YlGnBu")
plt.show()
• There is a 1:1 ratio for Average Open and Credit Limit, I will drop one of those.
• Months on Bomoks is highly corrolated with Age.
• Transaction amount and transaction count are also correlated.
# Drop as Avg Open as CV shows it is same as Cred Limit
df.drop(['Avg_Open_To_Buy'],axis=1,inplace=True)
# creating instance of labelencoder
labelencoder = LabelEncoder()
# Assigning numerical values and storing in another column
encodingList = ['Attrition_Flag',
'Gender','Education_Level','Marital_Status','Income_Category','Card_Category']
for i in encodingList:
df[i] = labelencoder.fit_transform(df[i])
# More meaningful to look at instead of 1 and 0 - for analysis and business
df['Attrition_Flag'].replace(1,'Existing Customer',inplace=True)
df['Attrition_Flag'].replace(0,'Attrited Customer',inplace=True)
### Function to plot stacked bar charts for categorical columns
def stacked_plot(x):
sns.set()
## crosstab
tab1 = pd.crosstab(x,df['Attrition_Flag'],margins=True).sort_values(by='Existing Customer',ascending=False)
print(tab1)
print('-'*120)
## visualising the cross tab
tab =pd.crosstab(x,df['Attrition_Flag'],normalize='index').sort_values(by='Existing Customer',ascending=False)
tab.plot(kind='bar',stacked=True,figsize=(17,7))
plt.legend(loc='lower left', frameon=False)
plt.legend(loc="upper left", bbox_to_anchor=(1,1))
plt.show()
stacked_plot(df['Months_on_book'])
Attrition_Flag Attrited Customer Existing Customer All Months_on_book All 1627 8500 10127 36 430 2033 2463 37 62 296 358 34 57 296 353 38 57 290 347 40 45 288 333 31 34 284 318 39 64 277 341 35 45 272 317 33 48 257 305 41 51 246 297 32 44 245 289 30 58 242 300 42 36 235 271 28 43 232 275 43 42 231 273 29 34 207 241 45 33 194 227 44 42 188 230 27 23 183 206 26 24 162 186 46 36 161 197 47 24 147 171 48 27 135 162 25 31 134 165 24 28 132 160 49 24 117 141 23 12 104 116 56 17 86 103 22 20 85 105 21 10 73 83 50 25 71 96 53 7 71 78 51 16 64 80 13 7 63 70 20 13 61 74 19 6 57 63 52 12 50 62 54 6 47 53 18 13 45 58 55 4 38 42 17 4 35 39 16 3 26 29 15 9 25 34 14 1 15 16 ------------------------------------------------------------------------------------------------------------------------
stacked_plot(df['Months_Inactive_12_mon'])
Attrition_Flag Attrited Customer Existing Customer All Months_Inactive_12_mon All 1627 8500 10127 3 826 3020 3846 2 505 2777 3282 1 100 2133 2233 4 130 305 435 5 32 146 178 6 19 105 124 0 15 14 29 ------------------------------------------------------------------------------------------------------------------------
sns.boxplot(x='Card_Category', y='Income_Category', hue='Attrition_Flag', data = df,
palette='winter')
<AxesSubplot:xlabel='Card_Category', ylabel='Income_Category'>
df.groupby(df['Attrition_Flag']).mean()
| Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Attrition_Flag | ||||||||||||||||||
| Attrited Customer | 46.659496 | 0.428396 | 2.402581 | 3.119852 | 1.494776 | 2.924401 | 0.170252 | 36.178242 | 3.279656 | 2.693301 | 2.972342 | 8136.039459 | 672.822987 | 0.694277 | 3095.025814 | 44.933620 | 0.554386 | 0.162475 |
| Existing Customer | 46.262118 | 0.479059 | 2.335412 | 3.092118 | 1.457412 | 2.852353 | 0.181647 | 35.880588 | 3.914588 | 2.273765 | 2.356353 | 8726.877518 | 1256.604118 | 0.772510 | 4654.655882 | 68.672588 | 0.742434 | 0.296412 |
plt.figure(figsize=(10,4))
sns.distplot(df[df["Attrition_Flag"] == 'Existing Customer']['Months_on_book'], color =
'g',label='Attrition_Flag=0')
sns.distplot(df[df["Attrition_Flag"] == 'Attrited Customer']['Months_on_book'], color =
'b',label='Attrition_Flag=1')
plt.legend()
plt.title("Distribution")
Text(0.5, 1.0, 'Distribution')
cols = data[['Avg_Utilization_Ratio','Total_Revolving_Bal','Credit_Limit', 'Total_Trans_Ct']].columns.tolist()
plt.figure(figsize=(12,7))
for i, variable in enumerate(cols):
plt.subplot(3,2,i+1)
sns.boxplot(data["Attrition_Flag"],data[variable],palette="PuBu")
plt.tight_layout()
plt.title(variable)
plt.show()
# creating instance of labelencoder
labelencoder = LabelEncoder()
# Assigning numerical values and storing in another column
encodingList = ['Attrition_Flag']
for i in encodingList:
df[i] = labelencoder.fit_transform(df[i])
columns = df.columns
len(df.columns)
19
df
| Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 45 | 1 | 3 | 3 | 1 | 2 | 0 | 39 | 5 | 1 | 3 | 12691.0 | 777 | 1.335 | 1144 | 42 | 1.625 | 0.061 |
| 1 | 1 | 49 | 0 | 5 | 2 | 2 | 4 | 0 | 44 | 6 | 1 | 2 | 8256.0 | 864 | 1.541 | 1291 | 33 | 3.714 | 0.105 |
| 2 | 1 | 51 | 1 | 3 | 2 | 1 | 3 | 0 | 36 | 4 | 1 | 0 | 3418.0 | 0 | 2.594 | 1887 | 20 | 2.333 | 0.000 |
| 3 | 1 | 40 | 0 | 4 | 3 | 3 | 4 | 0 | 34 | 3 | 4 | 1 | 3313.0 | 2517 | 1.405 | 1171 | 20 | 2.333 | 0.760 |
| 4 | 1 | 40 | 1 | 3 | 5 | 1 | 2 | 0 | 21 | 5 | 1 | 0 | 4716.0 | 0 | 2.175 | 816 | 28 | 2.500 | 0.000 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 10122 | 1 | 50 | 1 | 2 | 2 | 2 | 1 | 0 | 40 | 3 | 2 | 3 | 4003.0 | 1851 | 0.703 | 15476 | 117 | 0.857 | 0.462 |
| 10123 | 0 | 41 | 1 | 2 | 6 | 0 | 1 | 0 | 25 | 4 | 2 | 3 | 4277.0 | 2186 | 0.804 | 8764 | 69 | 0.683 | 0.511 |
| 10124 | 0 | 44 | 0 | 1 | 3 | 1 | 4 | 0 | 36 | 5 | 3 | 4 | 5409.0 | 0 | 0.819 | 10291 | 60 | 0.818 | 0.000 |
| 10125 | 0 | 30 | 1 | 2 | 2 | 3 | 1 | 0 | 36 | 4 | 3 | 3 | 5281.0 | 0 | 0.535 | 8395 | 62 | 0.722 | 0.000 |
| 10126 | 0 | 43 | 0 | 2 | 2 | 1 | 4 | 3 | 25 | 6 | 2 | 4 | 10388.0 | 1961 | 0.703 | 10294 | 61 | 0.649 | 0.189 |
10127 rows × 19 columns
# Create correlation matrix
corr_matrix = df.corr().abs()
# Select upper triangle of correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
# Find features with correlation greater than 0.95
to_drop = [column for column in upper.columns if any(upper[column] > 0.95)]
# Drop features
df.drop(to_drop, axis=1, inplace=True)
df
| Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 45 | 1 | 3 | 3 | 1 | 2 | 0 | 39 | 5 | 1 | 3 | 12691.0 | 777 | 1.335 | 1144 | 42 | 1.625 | 0.061 |
| 1 | 1 | 49 | 0 | 5 | 2 | 2 | 4 | 0 | 44 | 6 | 1 | 2 | 8256.0 | 864 | 1.541 | 1291 | 33 | 3.714 | 0.105 |
| 2 | 1 | 51 | 1 | 3 | 2 | 1 | 3 | 0 | 36 | 4 | 1 | 0 | 3418.0 | 0 | 2.594 | 1887 | 20 | 2.333 | 0.000 |
| 3 | 1 | 40 | 0 | 4 | 3 | 3 | 4 | 0 | 34 | 3 | 4 | 1 | 3313.0 | 2517 | 1.405 | 1171 | 20 | 2.333 | 0.760 |
| 4 | 1 | 40 | 1 | 3 | 5 | 1 | 2 | 0 | 21 | 5 | 1 | 0 | 4716.0 | 0 | 2.175 | 816 | 28 | 2.500 | 0.000 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 10122 | 1 | 50 | 1 | 2 | 2 | 2 | 1 | 0 | 40 | 3 | 2 | 3 | 4003.0 | 1851 | 0.703 | 15476 | 117 | 0.857 | 0.462 |
| 10123 | 0 | 41 | 1 | 2 | 6 | 0 | 1 | 0 | 25 | 4 | 2 | 3 | 4277.0 | 2186 | 0.804 | 8764 | 69 | 0.683 | 0.511 |
| 10124 | 0 | 44 | 0 | 1 | 3 | 1 | 4 | 0 | 36 | 5 | 3 | 4 | 5409.0 | 0 | 0.819 | 10291 | 60 | 0.818 | 0.000 |
| 10125 | 0 | 30 | 1 | 2 | 2 | 3 | 1 | 0 | 36 | 4 | 3 | 3 | 5281.0 | 0 | 0.535 | 8395 | 62 | 0.722 | 0.000 |
| 10126 | 0 | 43 | 0 | 2 | 2 | 1 | 4 | 3 | 25 | 6 | 2 | 4 | 10388.0 | 1961 | 0.703 | 10294 | 61 | 0.649 | 0.189 |
10127 rows × 19 columns
from scipy import stats
# create a DF with columns zscore < 3 - remove noise
df = df[(np.abs(stats.zscore(df)) < 3).all(axis=1)]
df
| Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5 | 1 | 44 | 1 | 2 | 2 | 1 | 1 | 0 | 36 | 3 | 1 | 2 | 4010.0 | 1247 | 1.376 | 1088 | 24 | 0.846 | 0.311 |
| 10 | 1 | 42 | 1 | 5 | 5 | 3 | 0 | 0 | 31 | 5 | 3 | 2 | 6748.0 | 1467 | 0.831 | 1201 | 42 | 0.680 | 0.217 |
| 14 | 1 | 57 | 0 | 2 | 2 | 1 | 4 | 0 | 48 | 5 | 2 | 2 | 2436.0 | 680 | 1.190 | 1570 | 29 | 0.611 | 0.279 |
| 19 | 1 | 45 | 0 | 2 | 2 | 1 | 5 | 0 | 37 | 6 | 1 | 2 | 14470.0 | 1157 | 0.966 | 1207 | 21 | 0.909 | 0.080 |
| 20 | 1 | 47 | 1 | 1 | 1 | 0 | 2 | 0 | 42 | 5 | 2 | 0 | 20979.0 | 1800 | 0.906 | 1178 | 27 | 0.929 | 0.086 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 10118 | 0 | 50 | 1 | 1 | 6 | 3 | 3 | 0 | 36 | 6 | 3 | 4 | 9959.0 | 952 | 0.825 | 10310 | 63 | 1.100 | 0.096 |
| 10119 | 0 | 55 | 0 | 3 | 5 | 2 | 5 | 0 | 47 | 4 | 3 | 3 | 14657.0 | 2517 | 0.166 | 6009 | 53 | 0.514 | 0.172 |
| 10123 | 0 | 41 | 1 | 2 | 6 | 0 | 1 | 0 | 25 | 4 | 2 | 3 | 4277.0 | 2186 | 0.804 | 8764 | 69 | 0.683 | 0.511 |
| 10124 | 0 | 44 | 0 | 1 | 3 | 1 | 4 | 0 | 36 | 5 | 3 | 4 | 5409.0 | 0 | 0.819 | 10291 | 60 | 0.818 | 0.000 |
| 10125 | 0 | 30 | 1 | 2 | 2 | 3 | 1 | 0 | 36 | 4 | 3 | 3 | 5281.0 | 0 | 0.535 | 8395 | 62 | 0.722 | 0.000 |
8842 rows × 19 columns
from sklearn.model_selection import train_test_split
columns
Index(['Attrition_Flag', 'Customer_Age', 'Gender', 'Dependent_count',
'Education_Level', 'Marital_Status', 'Income_Category', 'Card_Category',
'Months_on_book', 'Total_Relationship_Count', 'Months_Inactive_12_mon',
'Contacts_Count_12_mon', 'Credit_Limit', 'Total_Revolving_Bal',
'Total_Amt_Chng_Q4_Q1', 'Total_Trans_Amt', 'Total_Trans_Ct',
'Total_Ct_Chng_Q4_Q1', 'Avg_Utilization_Ratio'],
dtype='object')
X = df[['Customer_Age', 'Gender', 'Dependent_count',
'Education_Level', 'Marital_Status', 'Income_Category', 'Card_Category',
'Months_on_book', 'Total_Relationship_Count', 'Months_Inactive_12_mon',
'Contacts_Count_12_mon', 'Credit_Limit', 'Total_Revolving_Bal',
'Total_Amt_Chng_Q4_Q1', 'Total_Trans_Amt', 'Total_Trans_Ct',
'Total_Ct_Chng_Q4_Q1', 'Avg_Utilization_Ratio']]
y = df['Attrition_Flag']
# Splitting data into training, validation and test set:
# first we split data into 2 parts, say temporary and test
X_temp, X_test, y_temp, y_test = train_test_split(
X, y, test_size=0.2, random_state=1, stratify=y
)
# then we split the temporary set into train and validation
X_train, X_val, y_train, y_val = train_test_split(
X_temp, y_temp, test_size=0.25, random_state=1, stratify=y_temp
)
print(X_train.shape, X_val.shape, X_test.shape)
(5304, 18) (1769, 18) (1769, 18)
from sklearn.linear_model import LogisticRegression
logmodel = LogisticRegression()
logmodel.fit(X_train,y_train)
LogisticRegression()
predictions = logmodel.predict(X_test)
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test,predictions))
0.8903335217637083
from sklearn.tree import DecisionTreeClassifier
dtree = DecisionTreeClassifier()
dtree.fit(X_train,y_train)
DecisionTreeClassifier()
predictions = dtree.predict(X_test)
print(accuracy_score(y_test,predictions))
0.9366873940079141
feat_importances = pd.Series(dtree.feature_importances_, index=X.columns)
feat_importances.nlargest(10).plot(kind='barh')
<AxesSubplot:>
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=600)
rfc.fit(X_train,y_train)
RandomForestClassifier(n_estimators=600)
predictions = rfc.predict(X_test)
print(accuracy_score(y_test,predictions))
0.9564725833804409
feat_importances = pd.Series(rfc.feature_importances_, index=X.columns)
feat_importances.nlargest(10).plot(kind='barh')
<AxesSubplot:>
from sklearn import model_selection
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
seed = 8
kfold = model_selection.KFold(n_splits = 5)
# initialize the base classifier
base_cls = DecisionTreeClassifier()
# no. of base classifier
num_trees = 100
# bagging classifier
model = BaggingClassifier(base_estimator = base_cls,
n_estimators = num_trees,
random_state = seed)
results = model_selection.cross_val_score(model, X, y, cv = kfold)
print("accuracy :")
print(results.mean())
accuracy : 0.9131444254877235
# Import models and utility functions
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error as MSE
from sklearn import datasets
# Setting SEED for reproducibility
seed = 1
# Instantiate Gradient Boosting Regressor
gbr = GradientBoostingClassifier(n_estimators = 200, max_depth = 1, random_state =
seed)
# Fit to training set
gbr.fit(X_train, y_train)
# Predict on test set
predictions = gbr.predict(X_test)
# test set RMSE
test_rmse = MSE(y_test, predictions) ** (1 / 2)
# Print rmse
print('RMSE test set: {:.2f}'.format(test_rmse))
RMSE test set: 0.25
print(accuracy_score(y_test,predictions))
0.9366873940079141
# Encoding categorical data
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
encoder = LabelEncoder()
binary_encoded_y = pd.Series(encoder.fit_transform(y))
#X_train, X_test, y_train, y_test = train_test_split(X, binary_encoded_y,random_state=42)
X_temp, X_test, y_temp, y_test = train_test_split(
X, y, test_size=0.2, random_state=1, stratify=y
)
# then we split the temporary set into train and validation
X_train, X_val, y_train, y_val = train_test_split(X, binary_encoded_y,random_state=42)
print(X_train.shape, X_val.shape, X_test.shape)
(6631, 18) (2211, 18) (1769, 18)
classifier = AdaBoostClassifier(
DecisionTreeClassifier(max_depth=1),
n_estimators=200
)
classifier.fit(X_train,y_train)
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=1),
n_estimators=200)
predictions = classifier.predict(X_test)
print(accuracy_score(y_test,predictions))
0.963256076879593
#Grid Search
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV
clf = LogisticRegression()
grid_values = {'penalty': ['l1', 'l2'],'C':[0.001,.009,0.01,.09,1,5,10,25]}
grid_clf_acc = GridSearchCV(clf, param_grid = grid_values,scoring = 'recall')
grid_clf_acc.fit(X_train, y_train)
#Predict values based on new parameters
predictions = grid_clf_acc.predict(X_test)
# New Model Evaluation metrics
print('Accuracy Score : ' + str(accuracy_score(y_test,predictions)))
Accuracy Score : 0.889768230638779
from imblearn.over_sampling import SMOTE
print("Before UpSampling, counts of label 'Yes': {}".format(sum(y_train==1)))
print("Before UpSampling, counts of label 'No': {} \n".format(sum(y_train==0)))
sm = SMOTE(sampling_strategy = 1 ,k_neighbors = 5, random_state=1)
X_train_over, y_train_over = sm.fit_resample(X_train, y_train)
print("After UpSampling, counts of label 'Yes': {}".format(sum(y_train_over==1)))
print("After UpSampling, counts of label 'No': {} \n".format(sum(y_train_over==0)))
print('After UpSampling, the shape of train_X: {}'.format(X_train_over.shape))
print('After UpSampling, the shape of train_y: {} \n'.format(y_train_over.shape))
Before UpSampling, counts of label 'Yes': 5548 Before UpSampling, counts of label 'No': 1083 After UpSampling, counts of label 'Yes': 5548 After UpSampling, counts of label 'No': 5548 After UpSampling, the shape of train_X: (11096, 18) After UpSampling, the shape of train_y: (11096,)
log_reg_over = LogisticRegression(random_state = 1)
# Training the basic logistic regression model with training set
log_reg_over.fit(X_train_over,y_train_over)
LogisticRegression(random_state=1)
scoring='recall'
kfold=StratifiedKFold(n_splits=5,shuffle=True,random_state=1)
cv_result_over=cross_val_score(estimator=log_reg_over, X=X_train_over, y=y_train_over,
scoring=scoring, cv=kfold)
#Plotting boxplots for CV scores of model defined above
plt.boxplot(cv_result_over)
plt.show()
• Performance of model on training set varies between 0.80 to 0.83, which is not an improvement from the previous model
• Let's check the performance on the test set
## Function to calculate different metric scores of the model - Accuracy, Recall and Precision
def get_metrics_score(model,train,test,train_y,test_y,flag=True):
'''
model : classifier to predict values of X
'''
# defining an empty list to store train and test results
score_list=[]
pred_train = model.predict(train)
pred_test = model.predict(test)
train_acc = model.score(train,train_y)
test_acc = model.score(test,test_y)
train_recall = metrics.recall_score(train_y,pred_train)
test_recall = metrics.recall_score(test_y,pred_test)
train_precision = metrics.precision_score(train_y,pred_train)
test_precision = metrics.precision_score(test_y,pred_test)
score_list.extend((train_acc,test_acc,train_recall,test_recall,train_precision,test_precision))
# If the flag is set to True then only the following print statements will be
# dispayed. The default value is set to True.
if flag == True:
print("Accuracy on training set : ",model.score(train,train_y))
print("Accuracy on test set : ",model.score(test,test_y))
print("Recall on training set : ",metrics.recall_score(train_y,pred_train))
print("Recall on test set : ",metrics.recall_score(test_y,pred_test))
print("Precision on training set :",metrics.precision_score(train_y,pred_train))
print("Precision on test set : ",metrics.precision_score(test_y,pred_test))
return score_list # returning the list with train and test score
def make_confusion_matrix(model,y_actual,labels=[1, 0]):
'''
model : classifier to predict values of X
y_actual : ground truth
'''
y_predict = model.predict(X_test)
cm=metrics.confusion_matrix( y_actual, y_predict, labels=[0, 1])
df_cm = pd.DataFrame(cm, index = [i for i in ["Actual - No","Actual - Yes"]],
columns = [i for i in ['Predicted - No','Predicted - Yes']])
group_counts = ["{0:0.0f}".format(value) for value in
cm.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in
cm.flatten()/np.sum(cm)]
labels = [f"{v1}\n{v2}" for v1, v2 in
zip(group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
plt.figure(figsize = (10,7))
sns.heatmap(df_cm, annot=labels,fmt='')
plt.ylabel('True label')
plt.xlabel('Predicted label')
#Calculating different metrics
get_metrics_score(log_reg_over,X_train_over,X_test,y_train_over,y_test)
# creating confusion matrix
make_confusion_matrix(log_reg_over,y_test)
Accuracy on training set : 0.8167808219178082 Accuracy on test set : 0.7953646127755795 Recall on training set : 0.8192141312184571 Recall on test set : 0.7914691943127962 Precision on training set : 0.8152466367713005 Precision on test set : 0.955846279640229
from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(random_state = 1)
X_train_un, y_train_un = rus.fit_resample(X_train, y_train)
print("Before Under Sampling, counts of label 'Yes': {}".format(sum(y_train==1)))
print("Before Under Sampling, counts of label 'No': {} \n".format(sum(y_train==0)))
print("After Under Sampling, counts of label 'Yes': {}".format(sum(y_train_un==1)))
print("After Under Sampling, counts of label 'No': {} \n".format(sum(y_train_un==0)))
print('After Under Sampling, the shape of train_X: {}'.format(X_train_un.shape))
print('After Under Sampling, the shape of train_y: {} \n'.format(y_train_un.shape))
Before Under Sampling, counts of label 'Yes': 5548 Before Under Sampling, counts of label 'No': 1083 After Under Sampling, counts of label 'Yes': 1083 After Under Sampling, counts of label 'No': 1083 After Under Sampling, the shape of train_X: (2166, 18) After Under Sampling, the shape of train_y: (2166,)
log_reg_under = LogisticRegression(random_state = 1)
log_reg_under.fit(X_train_un,y_train_un )
LogisticRegression(random_state=1)
scoring='recall'
kfold=StratifiedKFold(n_splits=5,shuffle=True,random_state=1) #Setting number of splits equal to 5
cv_result_under=cross_val_score(estimator=log_reg_under, X=X_train_un, y=y_train_un,
scoring=scoring, cv=kfold)
#Plotting boxplots for CV scores of model defined above
plt.boxplot(cv_result_under)
plt.show()
#Calculating different metrics
get_metrics_score(log_reg_under,X_train_un,X_test,y_train_un,y_test)
# creating confusion matrix
make_confusion_matrix(log_reg_under,y_test)
Accuracy on training set : 0.8051708217913204 Accuracy on test set : 0.79479932165065 Recall on training set : 0.7848568790397045 Recall on test set : 0.7874069058903183 Precision on training set : 0.8180943214629451 Precision on test set : 0.9595709570957096
lr = LogisticRegression(random_state=1)
lr.fit(X_train,y_train)
LogisticRegression(random_state=1)
# Choose the type of classifier.
lr_estimator = LogisticRegression(random_state=1,solver='saga')
# Grid of parameters to choose from
parameters = {'C': np.arange(0.1,1.1,0.1)}
# Run the grid search
grid_obj = GridSearchCV(lr_estimator, parameters, scoring='recall')
grid_obj = grid_obj.fit(X_train_over, y_train_over)
# Set the clf to the best combination of parameters
lr_estimator = grid_obj.best_estimator_
# Fit the best algorithm to the data.
lr_estimator.fit(X_train_over, y_train_over)
LogisticRegression(C=0.1, random_state=1, solver='saga')
# defining list of model
models = [lr]
# defining empty lists to add train and test results
acc_train = []
acc_test = []
recall_train = []
recall_test = []
precision_train = []
precision_test = []
# looping through all the models to get the metrics score - Accuracy, Recall and Precision
for model in models:
j = get_metrics_score(model,X_train,X_test,y_train,y_test,False)
acc_train.append(j[0])
acc_test.append(j[1])
recall_train.append(j[2])
recall_test.append(j[3])
precision_train.append(j[4])
precision_test.append(j[5])
# defining list of models
models = [log_reg_over, lr_estimator]
# looping through all the models to get the metrics score - Accuracy, Recall and Precision
for model in models:
j = get_metrics_score(model,X_train_over,X_test,y_train_over,y_test,False)
acc_train.append(j[0])
acc_test.append(j[1])
recall_train.append(j[2])
recall_test.append(j[3])
precision_train.append(j[4])
precision_test.append(j[5])
# defining list of model
models = [log_reg_under]
# looping through all the models to get the metrics score - Accuracy, Recall and Precision
for model in models:
j = get_metrics_score(model,X_train_un,X_test,y_train_un,y_test,False)
acc_train.append(j[0])
acc_test.append(j[1])
recall_train.append(j[2])
recall_test.append(j[3])
precision_train.append(j[4])
precision_test.append(j[5])
comparison_frame = pd.DataFrame({'Model':['Logistic Regression','Logistic Regression on Oversampled data', 'Logistic Regression-Regularized (Oversampled data)','Logistic Regression on Undersampled data'], 'Train_Accuracy': acc_train,'Test_Accuracy':
acc_test, 'Train_Recall':recall_train,'Test_Recall':recall_test, 'Train_Precision':precision_train,'Test_Precision':precision_test})
#Sorting models in decreasing order of test recall
comparison_frame
| Model | Train_Accuracy | Test_Accuracy | Train_Recall | Test_Recall | Train_Precision | Test_Precision | |
|---|---|---|---|---|---|---|---|
| 0 | Logistic Regression | 0.888855 | 0.890334 | 0.963590 | 0.964116 | 0.909029 | 0.909904 |
| 1 | Logistic Regression on Oversampled data | 0.816781 | 0.795365 | 0.819214 | 0.791469 | 0.815247 | 0.955846 |
| 2 | Logistic Regression-Regularized (Oversampled d... | 0.702145 | 0.763143 | 0.811644 | 0.799594 | 0.665829 | 0.905675 |
| 3 | Logistic Regression on Undersampled data | 0.805171 | 0.794799 | 0.784857 | 0.787407 | 0.818094 | 0.959571 |
log_odds = log_reg_under.coef_[0]
pd.DataFrame(log_odds, X_train_un.columns, columns=['coef']).T
| Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| coef | -0.084127 | 0.024116 | -0.290665 | -0.218568 | -0.145906 | -0.21857 | -0.004135 | 0.03586 | 0.076052 | -0.400268 | -0.420083 | -7.438820e-07 | 0.000845 | -0.024276 | -0.000774 | 0.159661 | 0.039938 | -0.005208 |
odds = np.exp(np.abs(log_reg_under.coef_[0]))-1
pd.set_option('display.max_rows',None)
pd.DataFrame(odds, X_train_un.columns, columns=['Change in odds']).T
| Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Change in odds | 0.087767 | 0.02441 | 0.337316 | 0.244294 | 0.157088 | 0.244296 | 0.004144 | 0.036511 | 0.079019 | 0.492224 | 0.522087 | 7.438823e-07 | 0.000845 | 0.024573 | 0.000774 | 0.173113 | 0.040746 | 0.005222 |
imputer = SimpleImputer(strategy="median")
impute = imputer.fit(X_train)
X_train = impute.transform(X_train)
X_val = imputer.transform(X_val)
X_test = imputer.transform(X_test)
models = [] # Empty list to store all the models
# Appending models into the list
models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("Random forest", RandomForestClassifier(random_state=1)))
models.append(("GBM", GradientBoostingClassifier(random_state=1)))
models.append(("Adaboost", AdaBoostClassifier(random_state=1)))
models.append(("Xgboost", XGBClassifier(random_state=1, eval_metric="logloss")))
models.append(("dtree", DecisionTreeClassifier(random_state=1)))
results = [] # Empty list to store all model's CV scores
names = [] # Empty list to store name of the models
# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation Performance:" "\n")
for name, model in models:
scoring = "recall"
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
) # Setting number of splits equal to 5
cv_result = cross_val_score(
estimator=model, X=X_train, y=y_train, scoring=scoring, cv=kfold
)
results.append(cv_result)
names.append(name)
print("{}: {}".format(name, cv_result.mean() * 100))
print("\n" "Training Performance:" "\n")
for name, model in models:
model.fit(X_train, y_train)
scores = recall_score(y_train, model.predict(X_train)) * 100
print("{}: {}".format(name, scores))
Cross-Validation Performance: Bagging: 97.3502303024395 Random forest: 98.68418102502862 GBM: 98.72026580232169 Adaboost: 97.6568615504594 Xgboost: 98.63006198263187 dtree: 95.71016823857221 Training Performance: Bagging: 99.76568132660418 Random forest: 100.0 GBM: 99.15284787310743 Adaboost: 98.05335255948089 Xgboost: 100.0 dtree: 100.0
# Plotting boxplots for CV scores of all models defined above
fig = plt.figure(figsize=(10, 7))
fig.suptitle("Algorithm Comparison")
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()
We will tune - GBM, Adaboost and XGBoost and see if the performance improves.
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{
"Accuracy": acc,
"Recall": recall,
"Precision": precision,
"F1": f1,
},
index=[0],
)
return df_perf
def confusion_matrix_sklearn(model, predictors, target):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
"""
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
%%time
# defining model
# model = GradientBoostingClassifier(random_state=1)
# Parameter grid to pass in GridSearchCV
param_grid = {
"n_estimators": np.arange(10, 110, 10),
"learning_rate": [0.1, 0.01, 0.2, 0.05, 1],
"base_estimator": [
DecisionTreeClassifier(max_depth=1, random_state=1),
DecisionTreeClassifier(max_depth=2, random_state=1),
DecisionTreeClassifier(max_depth=3, random_state=1),
],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
# Calling GridSearchCV
param_test1 = {'n_estimators':range(20,81,10)}
grid_cv = GridSearchCV(estimator = GradientBoostingClassifier(learning_rate=0.1, min_samples_split=500,min_samples_leaf=50,max_depth=8,max_features='sqrt',subsample=0.8,random_state=10),
param_grid = param_test1, scoring='roc_auc',n_jobs=4,cv=5)
# Fitting parameters in GridSeachCV
grid_cv.fit(X_train, y_train)
print(
"Best Parameters:{} \nScore: {}".format(grid_cv.best_params_, grid_cv.best_score_)
)
Best Parameters:{'n_estimators': 80}
Score: 0.9891755685077189
CPU times: user 650 ms, sys: 72 ms, total: 722 ms
Wall time: 3.74 s
# building model with best parameters
gbm_tuned1 = GradientBoostingClassifier(
n_estimators=80,
#learning_rate=1,
#random_state=1,
#base_estimator=DecisionTreeClassifier(max_depth=3, random_state=1),
)
# Fit the model on training data
gbm_tuned1.fit(X_train, y_train)
GradientBoostingClassifier(n_estimators=80)
# Calculating different metrics on train set
gbm_grid_train = model_performance_classification_sklearn(
gbm_tuned1, X_train, y_train
)
print("Training performance:")
gbm_grid_train
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.973006 | 0.991348 | 0.976736 | 0.983988 |
# Calculating different metrics on validation set
gbm_grid_val = model_performance_classification_sklearn(gbm_tuned1, X_val, y_val)
print("Validation performance:")
gbm_grid_val
Validation performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.961104 | 0.989089 | 0.964875 | 0.976832 |
# creating confusion matrix
confusion_matrix_sklearn(gbm_tuned1, X_val, y_val)
%%time
# defining model
#model = GradientBoostingClassifier(random_state=1)
# Parameter grid to pass in GridSearchCV
param_grid = {
"n_estimators": np.arange(10, 110, 10),
"learning_rate": [0.1, 0.01, 0.2, 0.05, 1],
"base_estimator": [
DecisionTreeClassifier(max_depth=1, random_state=1),
DecisionTreeClassifier(max_depth=2, random_state=1),
DecisionTreeClassifier(max_depth=3, random_state=1),
],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
param_test1 = {'n_estimators':range(20,81,10)}
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator = GradientBoostingClassifier(learning_rate=0.1, min_samples_split=500,min_samples_leaf=50,max_depth=8,max_features='sqrt',subsample=0.8,random_state=10),
param_distributions = param_test1, scoring=scorer,n_jobs=4,cv=5)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train,y_train)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'n_estimators': 80} with CV score=0.9864820997733531:
CPU times: user 605 ms, sys: 14.7 ms, total: 619 ms
Wall time: 3.17 s
# building model with best parameters
gbm_tuned2 = GradientBoostingClassifier(
n_estimators=80,
#learning_rate=1,
#random_state=1,
#base_estimator=DecisionTreeClassifier(max_depth=3, random_state=1),
)
# Fit the model on training data
gbm_tuned2.fit(X_train, y_train)
GradientBoostingClassifier(n_estimators=80)
# Calculating different metrics on train set
gbm_random_train = model_performance_classification_sklearn(
gbm_tuned2, X_train, y_train
)
print("Training performance:")
gbm_random_train
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.973006 | 0.991348 | 0.976736 | 0.983988 |
# Calculating different metrics on validation set
gbm_random_val = model_performance_classification_sklearn(gbm_tuned1, X_val, y_val)
print("Validation performance:")
gbm_random_val
Validation performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.961104 | 0.989089 | 0.964875 | 0.976832 |
# creating confusion matrix
confusion_matrix_sklearn(gbm_tuned2, X_val, y_val)
%%time
# defining model
model = AdaBoostClassifier(random_state=1)
# Parameter grid to pass in GridSearchCV
param_grid = {
"n_estimators": np.arange(10, 110, 10),
"learning_rate": [0.1, 0.01, 0.2, 0.05, 1],
"base_estimator": [
DecisionTreeClassifier(max_depth=1, random_state=1),
DecisionTreeClassifier(max_depth=2, random_state=1),
DecisionTreeClassifier(max_depth=3, random_state=1),
],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
# Calling GridSearchCV
grid_cv = GridSearchCV(estimator=model, param_grid=param_grid, scoring=scorer, cv=5, n_jobs = -1)
# Fitting parameters in GridSeachCV
grid_cv.fit(X_train, y_train)
print(
"Best Parameters:{} \nScore: {}".format(grid_cv.best_params_, grid_cv.best_score_)
)
Best Parameters:{'base_estimator': DecisionTreeClassifier(max_depth=1, random_state=1), 'learning_rate': 0.1, 'n_estimators': 10}
Score: 1.0
CPU times: user 2.01 s, sys: 1.73 s, total: 3.74 s
Wall time: 7.92 s
# building model with best parameters
adb_tuned1 = AdaBoostClassifier(
n_estimators=20,
learning_rate=1,
random_state=1,
base_estimator=DecisionTreeClassifier(max_depth=3, random_state=1),
)
# Fit the model on training data
adb_tuned1.fit(X_train, y_train)
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3,
random_state=1),
learning_rate=1, n_estimators=20, random_state=1)
# Calculating different metrics on train set
Adaboost_grid_train = model_performance_classification_sklearn(
adb_tuned1, X_train, y_train
)
print("Training performance:")
Adaboost_grid_train
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.982205 | 0.990988 | 0.987783 | 0.989383 |
# Calculating different metrics on validation set
Adaboost_grid_val = model_performance_classification_sklearn(adb_tuned1, X_val, y_val)
print("Validation performance:")
Adaboost_grid_val
Validation performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.956128 | 0.978723 | 0.968683 | 0.973677 |
# creating confusion matrix
confusion_matrix_sklearn(adb_tuned1, X_val, y_val)
%%time
# defining model
model = AdaBoostClassifier(random_state=1)
# Parameter grid to pass in GridSearchCV
param_grid = {
"n_estimators": np.arange(10, 110, 10),
"learning_rate": [0.1, 0.01, 0.2, 0.05, 1],
"base_estimator": [
DecisionTreeClassifier(max_depth=1, random_state=1),
DecisionTreeClassifier(max_depth=2, random_state=1),
DecisionTreeClassifier(max_depth=3, random_state=1),
],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_jobs = -1, n_iter=50, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train,y_train)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'n_estimators': 50, 'learning_rate': 0.01, 'base_estimator': DecisionTreeClassifier(max_depth=1, random_state=1)} with CV score=1.0:
CPU times: user 814 ms, sys: 94.6 ms, total: 909 ms
Wall time: 2.63 s
# building model with best parameters
adb_tuned2 = AdaBoostClassifier(
n_estimators=20,
learning_rate=1,
random_state=1,
base_estimator=DecisionTreeClassifier(max_depth=3, random_state=1),
)
# Fit the model on training data
adb_tuned2.fit(X_train, y_train)
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3,
random_state=1),
learning_rate=1, n_estimators=20, random_state=1)
# Calculating different metrics on train set
Adaboost_random_train = model_performance_classification_sklearn(
adb_tuned2, X_train, y_train
)
print("Training performance:")
Adaboost_random_train
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.982205 | 0.990988 | 0.987783 | 0.989383 |
# Calculating different metrics on validation set
Adaboost_random_val = model_performance_classification_sklearn(adb_tuned2, X_val, y_val)
print("Validation performance:")
Adaboost_random_val
Validation performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.956128 | 0.978723 | 0.968683 | 0.973677 |
# creating confusion matrix
confusion_matrix_sklearn(adb_tuned2, X_val, y_val)
Removed from models as it does not completed even on 56 Cores machine. Computationally too expensive
models_train_comp_df = pd.concat(
[
Adaboost_grid_train.T,
Adaboost_random_train.T,
gbm_grid_train.T,
gbm_random_train.T,
# xgboost_grid_train.T,
# xgboost_random_train.T,
],
axis=1,
)
models_train_comp_df.columns = [
"AdaBoost Tuned with Grid search",
"AdaBoost Tuned with Random search",
"GBM Grid Search",
"GBM Random search",
#"Xgboost Tuned with Grid search",
#"Xgboost Tuned with Random Search",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
| AdaBoost Tuned with Grid search | AdaBoost Tuned with Random search | GBM Grid Search | GBM Random search | |
|---|---|---|---|---|
| Accuracy | 0.982205 | 0.982205 | 0.973006 | 0.973006 |
| Recall | 0.990988 | 0.990988 | 0.991348 | 0.991348 |
| Precision | 0.987783 | 0.987783 | 0.976736 | 0.976736 |
| F1 | 0.989383 | 0.989383 | 0.983988 | 0.983988 |
# Validation performance comparison
models_val_comp_df = pd.concat(
[
Adaboost_grid_val.T,
Adaboost_random_val.T,
gbm_grid_val.T,
gbm_random_val.T,
#xgboost_grid_val.T,
#xgboost_random_val.T,
],
axis=1,
)
models_val_comp_df.columns = [
"AdaBoost Tuned with Grid search",
"AdaBoost Tuned with Random search",
"GBM Grid Tuned with Grid Search",
"GBM Grid Tuned with Random Search",
#"Xgboost Tuned with Grid search",
#"Xgboost Tuned with Random Search",
]
print("Validation performance comparison:")
models_val_comp_df
Validation performance comparison:
| AdaBoost Tuned with Grid search | AdaBoost Tuned with Random search | GBM Grid Tuned with Grid Search | GBM Grid Tuned with Random Search | |
|---|---|---|---|---|
| Accuracy | 0.956128 | 0.956128 | 0.961104 | 0.961104 |
| Recall | 0.978723 | 0.978723 | 0.989089 | 0.989089 |
| Precision | 0.968683 | 0.968683 | 0.964875 | 0.964875 |
| F1 | 0.973677 | 0.973677 | 0.976832 | 0.976832 |
# Calculating different metrics on the test set
gbm_grid_test = model_performance_classification_sklearn(gbm_tuned1, X_test, y_test)
print("Test performance:")
gbm_grid_test
Test performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.971735 | 0.991198 | 0.97535 | 0.98321 |
feature_names = X.columns
importances = gbm_tuned2.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
# Creating new pipeline with best parameters
dtree_tuned_GS = make_pipeline(
StandardScaler(),
DecisionTreeClassifier(criterion='entropy', max_depth= 10,max_features= 'sqrt',
max_leaf_nodes=30, min_samples_leaf=20, min_samples_split=90,
splitter='best'),
)
dtree_tuned_GS.fit(X_train, y_train)
Pipeline(steps=[('standardscaler', StandardScaler()),
('decisiontreeclassifier',
DecisionTreeClassifier(criterion='entropy', max_depth=10,
max_features='sqrt', max_leaf_nodes=30,
min_samples_leaf=20,
min_samples_split=90))])
# Creating new pipeline with best parameters
gbm_tuned_p = make_pipeline(
StandardScaler(),
GradientBoostingClassifier(
n_estimators=80,
),
)
gbm_tuned_p.fit(X_train, y_train)
Pipeline(steps=[('standardscaler', StandardScaler()),
('gradientboostingclassifier',
GradientBoostingClassifier(n_estimators=80))])
#Using above defined function to get accuracy, recall and precision on train and test set
gbm_tuned_p_score=get_metrics_score(gbm_tuned_p,X_train,X_test,y_train,y_test)
Accuracy on training set : 0.9730055798522094 Accuracy on test set : 0.971735443753533 Recall on training set : 0.9913482335976929 Recall on test set : 0.991198375084631 Precision on training set : 0.9767359261232463 Precision on test set : 0.9753497668221186
make_confusion_matrix(gbm_tuned_p,y_test)
• GBM did the best in the test data.
• Top three factors that effect credit card attrition: Total Transaction Amount, Total Transaction Counts, and Total Revolving Balance. So, in short, those who keep thier account, use thier credit cards a lot.
• Business should focus on getting customers to use the credit they have more often. This seems to be the best predictor of keeping a customer.
• Over 30% of customers have graduate degrees and make less than $40K a year Attrited Client Profile:¶
• Clients who are contacted a lot are 52% more likly to leave.
• Clients who are inactive during a 12 month period are 49% more likely to leave.
• Clients who have a smaller credit limit.
• Clients who have a 50% smaller revolving balance.